As the number of dimensions in a dataset increases, the process of visualising its structure and variable dependencies becomes more tedious. Scagnostics (scatterplot diagnostics) are a set of visual features that can be used to identify interesting and abnormal scatterplots, and thus give a sense of priority to the variables we choose to visualise. Here, we will discuss the creation of the cassowaryr R package that will provide a user-friendly method to calculate these scagnostics, as well as the development of adjusted measures not previously defined in the literature. The package is be tested on datasets with known interesting visual features to ensure the scagnostics are working as expected,before being applied to time series, physics and AFLW data to show their value as a preliminary step in exploratory data analysis.
Visualising high dimensional data is often difficult and requires a trade-off between the usefulness of the plots and maintaining the structures of the original data. This is because the number of possible pairwise plots rises exponentially with the number of dimensions. Datasets like Anscombe’s quartet (Anscombe 1973) or the datasaurus dozen (Locke and D’Agostino McGowan 2018) have been constructed such that each pairwise plot has the same summary statistics but strikingly different visual features. This design is to illustrate the pitfalls of numerical summaries and the importance of visualisation. This means that despite the issues that come with increasing dimensionality, visualisation of the data cannot be ignored. Scagnostics offer one possible solution to this issue.
The term scagnostics was introduced by John Tukey in 1982 (Tukey 1988). Tukey discusses the value of a cognostic (a diagnostic that should be interpreted by a computer rather than a human) to filter out uninteresting visualisations. He denotes a cognostic that is specific to scatter plots a scagnostic. Up to a moderate number of variables, a scatter plot matrix (SPLOM) can be used to create pairwise visualisations, however, this solution quickly becomes infeasible. Thus, instead of trying to view every possible variable combination, the workload is reduced by calculating a series of visual features, and only presenting the outlier scatter plots on these feature combinations.
There is a large amount of research into visualising high dimensional data, most of which focuses on some form of dimension reduction. This can be done by creating a hierarchy of potential variables, performing a transformation of the variables, or some combination of the two. Unfortunately none of these methods are without pitfalls. Linear transformations are subject to crowding, where low level projections concentrate data in the centre of the distribution, making it difficult to differentiate data points (Diaconis and Freedman 1984). Non-linear transformations often have complex parameterisations, and can break the underlying global structure of the data, creating misleading visualisations. While there are solutions within these methods to fix these issues such as a burning sage tour which zooms in further on points closer to the middle of a tour to prevent crowding (U. Laa, Cook, and Lee 2020), or the liminal package which facilitates linked brushing between a non-linear and linear data transformations to maintaining global structure (Lee, Laa, and Cook 2020), all these methods still involve some transformation of the data. Scagnostics gives the benefit of allowing the user to view relationships between the variables in their raw form. This means they are not subject to the linear transformation issue of crowding, or the non-linear transformation issue of misleading global structures. That being said, only viewing pairwise plots can leave our variable interpretations without context. Methods such as those shown in ScagExplorer (Dang and Wilkinson 2014) try to address this by visualising the pairwise plots in relation to the scagnostic measures distribution, but ultimately the lack of context remains one of the limitations of using scagnostics alone as a dimension reduction technique.
Scagnostics are not only useful in isolation, they can be applied in conjunction with other techniques to find interesting feature combinations of the transformed variables. The tourr projection pursuit currently uses a selection of scagnostics to identify interesting low level projections and move the visualisation towards them (U. Laa and Cook 2020). Since scagnostics are not dependent on the type of data, they can also be used to compare and contrast scatter plots regardless of the discipline. In this way, they are a useful metric for something like the comparisons described in A self-organizing, living library of time-series data, which tries to organise time series by their features instead of on their metadata (Fulcher et al. 2020).
Several scagnostics have been previously defined in Graph-Theoretic Scagnostics (L. Wilkinson, Anand, and Grossman 2005), which are typically considered the basis of the visual features. They were all constructed to range [0,1], and later scagnostics have maintained this scale. The formula for these measures were revised in Scagnostic Distributions and are still calculated according to this paper (Leland Wilkinson and Wills 2008). In addition to the main nine, the benefit of using two additional association scagnostics were discussed in Katrin Grimm’s PhD thesis (Grimm 2016). These two association measures are also used in the tourr projection pursuit (U. Laa and Cook 2020).
There are two existing scagnostics packages, scagnostics (Leland Wilkinson and Wills 2008) and the archived package binostics (Ursula Laa et al. 2020). Both are based on the original C++ code from Scagnostic Distributions (Leland Wilkinson and Wills 2008), which is difficult to read and difficult to debug. Thus there is a need for a new implementation that enables better diagnosis of the scagnostics, and better graphical tools for examining the results.
This paper describes the R package, cassowaryr that computes the currently existing scagnostics, and adds several new measures. The paper is organised as follows. The next section explains the scagnostics. This is followed by a description of the implementation. Several examples using collections of time series and XXX illustrate the usage.
In order to capture the visual structure of the data, graph theory is used to calculate most of the scagnostics. The pairwise scatter plot is re-constructed as a graph with the data points as vertices and the edges are calculated using Delaunay triangulation. In the package, this calculation is done using the alphahull package (Pateiro-Lopez, Rodriguez-Casal, and. 2019) to construct an object called a scree. This is the basis for all the other objects that are used to calculate the scagnostics (except for monotonic, dcor and splines which use the raw data). The graph (screen object) is then used to construct the three key structures on which the scagnostics are based; the convex hull, alpha hull and minimum-spanning tree (MST) (Figure 1).
Convex hull: The outside vertices of the graph, connected to make a convex polygon that contains all points. It is constructed usnig the tripack package.
Alpha hull: A collection of boundaries that contain all the points in the graph. Unlike the convex hull, it does not need to be convex. It is calculated using the alphahull package (Pateiro-Lopez, Rodriguez-Casal, and. 2019).
MST: the minimum spanning tree, i.e the smallest distance of branches that can be used to connect all the points. In the package it is calculated from the graph using the igraph package (Csardi and Nepusz 2006).
Figure 1: The building blocks for graph-based scagnostics
The nine scagnostics defined in Scagnostic Distributions are detailed below with an explanation, formula, and visualisation. We will let A= alpha Hull C= convex hull, M = minimum spanning tree, and s= the scagnostic measure. Since some of the measures have some sample size dependence, we will let w be a constant that adjusts for that.
\[s_{convex}=w\frac{area(A)}{area(C)}\]
\[s_{skinny}= 1-\frac{\sqrt{4\pi area(A)}}{perimeter(A)}\]
\[s_{outlying}=\frac{length(M_{outliers})}{length(M)}\]
\[s_{stringy} = \frac{|V^{(2)}|}{|V|-|V^{(1)}|}\]
\[s_{skewed} = 1-w(1-\frac{q_{90}-{q_{50}}}{q_{90}-q_{10}})\]
\[s_{sparse}= wq_{90}\]
\[\max_{j}[1-\frac{\max_{k}[length(e_k)]}{length(e_j)}]\]
\[\frac1{|V|}\sum_{v \in V^{2}}I(cos\theta_{e(v,a)e(v,b)}<-0.75)\]
\[s_{monotonic} = r^2_{spearman}\]
The two additional scagnostics discussed by Katrin Grimm are described below.
\[s_{splines}=\max_{i\in x,y}[1-\frac{Var(Residuals_{model~i=.})}{Var(i)}]\]
\[s_{dcor}= \sqrt{\frac{\mathcal{V}(X,Y)}{\mathcal{V}(X,X)\mathcal{V}(Y,Y)}}\]
where \[\mathcal{V}
(X,Y)=\frac{1}{n^2}\sum_{k=1}^n\sum_{l=1}^nA_{kl}B_{kl}\]
where \[A_{kl}=a_{kl}-\bar{a}_{k.}-\bar{a}_{.j}-\bar{a}_{..}\] \[B_{kl}=b_{kl}-\bar{b}_{k.}-\bar{b}_{.j}-\bar{b}_{..}\]
Once we have working functions that correctly calculate the scagnostics according to their definition, we can assess how well they identify the visual features of scatter plots. To test the packages ability to differentiate plots, we have creates a dataset called “features” (that is also in the Cassowaryr package) that contains a series of interesting and unique scatter plots which we can run our scagnsotics on.
(#fig:Features plot)The Scatter Plots of the Features Dataset
These scatter plots typify certain visual features we want to look for in scatter plots, be it deterministic relationships (such as that shown in the nonlinear feature), discreteness in variables (vlines), or clustering (clumpy), we should be able to use scagnostics to identify each of these scatter plots. Below is a visual table of an example of a high, a moderate, and a low value, on each scagnostic. The scagnostics are supposed to range from 0 to 1 however in some cases the values are so compressed that a moderate value would not fit, indicating that the scagnostics do not work quite as intended. We suspect the reason for these warped distributions is the removal of binning as a preliminary step in calculating the scagnostics. We wanted the package to have binning as an optional method, considering choices in binning can lead to bias as noted in “Scagnostic Distributions” (Leland Wilkinson and Wills 2008) or unreproducible results as noted in “Robustness of Scagnostics” . Therefore the current scagnostics will be assessed without binning (Wang et al. 2020).
(#fig:Visual Table)The Features Scatterplots in a Visual Table
This plot gives a slight idea of the issues some of the scagnsotics face in their current state. The scagnostics based upon the convex hull (i.e. skinny and convex) work fine, as do the association measures such as montonic, dcor and splines. The main issue comes from the measures based on the MST, and their issues largely come from binning. The MST measures and their issues are:
Striated: striated can identify the specific case of one discrete variable and one continuous variable (which alone is not particularly interesting) but will not identify two discrete variables. Since by definition it is a subset of the stringy measure, they are highly correlated, which means most variables that score highly on striated already score highly on stringy, making the measure less useful.
Sparse: While sparse does seem to identify spread out distributions, it rarely returns a value higher than 0.1. As this measure is the 90th percentile of MST edge lengths, and the removal of binning allows for a large number of arbitrarily small edges. In addiiton to this, a larger number of observations will arbitrarily make this value smaller. The addition of new points will increase the number of small edges and decrease the number of large edges, and it is rare that a significantly large edge will be at or below the 90th percentile.
Skewed: this measure can identify skewed edge lengths (such as the L shape in the visual table) however the values on real data rarely drop below 0.5 or rise above 0.8. Skewed seems to suffers from the same issue as sparse reguarding the binning issue and is also heavily influenced by the number of observations in the scatter plot.
Outlying: the disctinction of outlying points described in the scagnostic literature is certainly strange. By definition an outlier must have all its adjacent edges in the MST above this threshold, and the visual table displays three interesting cases of this. The first plot (outliers2) returns a 0 even though the handful of points in the top corner would likely be considered to be outliers by a human. This is because within that group the points are close enough that all of them have at least one edge that is below the outlying threshold. Even if we changed the measure such that only one edge needed to be above the outlying threshold, it would only remove a single point. The l-shape shows an increasing spread of the points as they move away from the bottom left corner, as such, the larger edge lengths make sense within the distribution. Outlying does not take this into account, and identifies a large number of the spread out points to be outliers and removes them before computing the other scagnsotics. The value that scores the highest on the outlying measure is, without question, a highly outlying distribution, however the outlying measure only returns a 0.5, this is again due to the removal of binning as a preprocessing step.
Clumpy: the clumpy scagnostic is probably the one that suffers the most with the removal of the binning step. Due to it being a ratio between an edge and its longest adjacent edge, it does not identify the largest edge, but rather an edge that is connected to an arbitrarily small edge. Because of this, this scagnsotic reliably returns an arbitrarily high value and scatter plots that actually contains clusters (such as clusters) scores low on this measure, while a continuous variable plotted against a discrete variable score arbitrarily high.
Stringy: This measure rarely drops below 0.5 even on data generated from a random normal distribution (which should intuitively return a 0). Unlike the other scagnostics on this list, stringy does not depend upon the edge lengths of the MST, so it is hard to say if this issue stems from binning.
Some of these measures primary issue is that they do not seem to range uniformly from 0 to 1. These measures still put the scatter plots in the correct order and so do not urgently need adjustments in order to work as intended. For other measures, removing the binning step completely changes the visual feature the measure is identifying. For that reason we will only adjust the measures which no longer make sense without binning, and keep the measures that have a warped distribution (but correctly order the scagnostics) as they are. With these issues in mind, we have defines and written several new scagnostics that work even without the pre-processing step of binning.
The measures that need an adjusted version are striated, sparse, skewed, and clumpy. The outlying and stringy measure could possibly be left as they are, as they are not as badly damaged by the removal of binning.
The issues surrounding the striated scagnostic are:
By only counting vertices with 2 edges, the set of vertices counted in this measure are a subset of those counted in stringy, thus the two meaures are highly correlated.
In order for the vertex to be counted, the angle between the edges needs to be approximately 135 to 220 degrees. The original idea seems to have been to identify the predominantly 180 degree angles that come with a discrete variable plotted against a continuous one, however the large margin of error just makes the measure almost identical to stringy.
To account for these two issues the striated adjusted measure considers all vertices (not just those with two adjacent edges), and makes the measure strict around the 180 and 90 degree angles. With this we can see the improvements on the measure.
(#fig:Striated Vtable)A Visual Table Comparison of Striated and Striated 2
While these two measures may seem similar at a glace, there are a few minor things that make the striated2 scagnsotic an improvement on the stirated scagnsotic. First of all, the perfect 1 value on striated goes to the “line” scatter plot. While this does fulfil the definition, it is not what the measure is supposed to be looking for, rather supposed to be identifying the “vlines” scatter plot. Since striated does not count the right angles that go between the vertical lines, a truely striated plot will never get a full 1 on this measure, striated adjusted fixes this. After that there is a large gap in both measures because none of the other scatter plots have a strictly discrete measure on the x or y axis. The lower plots show that striated2 is also better at identifying discrete relationships with a rotation and noise added as shown in the “discrete” plot. In striated “discrete” is lower in the order than “outlying” which would indicate that striated has finished looking at discreteness. In striated2, after the plots with strict discreteness in “vlines” or strict rotated disceteness in “line,” is the noisey and rotated “discrete” plot. Therefore in terms of ordering the plots in how well they represent the feature of discreteness, striated2 outperforms striated.
The scagnsotics need to be used and interpreted with the type of dataset you are working with in mind. For if we are looking at a dataset that is discrete, a very low value on striated2 would indicate some strange relationship in the scatter plot. Since the old striated measure is specifically trying to find a continuous variable against a discrete variable, its highest values are also identified by the striated2. The lowest values on striated simply identify a plot where all the variables are at right angles, once again a measure of disceteness but one that is not identified by striated. Striated2 encapsulates both versions of discreteness in the values that get exactly a 1.
The issues that need to be addressed with the new clumpy measure are:
It needs to consider more than 1 edge in its final measure to make the measure more robust
The impact of the ratio between the long and short edges need to be weighted by the size of their clusters so the measure does not simply identify outliers
It should not consider vertices that’s adjacent angles form a straight line (to avoid identifying the angles striated identifies)
Before we calculated a new clumpy measure, we looked into applying a different adjustment defined in the Improving the robustness of scagnsotics that is a robust version of the original clumpy measure (Wang et al. 2020). This version of clumpy has been included in the package as “clumpy_r” however it is not included as an option in the higher level functions such as calc_scags() because its computation time is too long. This measure tries to build multiple clusters, each having their own clumpy value, and then returns the weighted sum, where each value is weighted by the number of observations in that cluster. This version of clumpy spreads the scatter plots more evenly between 0 and 1 and is more robust to outliers, however it does a poor job of ordering plots generally considered to be clumpy without the assistance of binning. Since this scagnostic cannot be used in large scale scagnostic calculations (such as those done on every pairwise combination of variables as is intended by the package) and it maintains the ordering issue from the original measure, it is not discussed here.
Therefore in order to fix the issues in the clumpy measure described above, we designed an adjusted clumpy measure, called clumpy2 in the package, and it is calculated as follows:
With this calculation, we generate the clumpy2 measure which is compared to the original clumpy measure in the figure below.
(#fig:Clumpy Vtable)A Visual Table Comparison of Clumpy and Clumpy 2
Here we can see the improvements made on the clumpy measure in both distribution from 0 to 1 and ordering. The measure is more spread out, and so values range more accurately from 0 to 1. More importantly the measures do a better job of ordering the scatter plots. On the original clumpy measure the “clusters” scatter plot was next to last, on the clumpy2 measure “clusters” is is identified as the most clumpy scatter plot. Clumpy 2 also has a penalty for uneven clusters (to avoid being large due to a small colelction of outliers) and clusters created arbitrarily due to discreteness (such as vlines) in order to better aling with the human interpretation of clumpy. With these changes, the stronger performance of clumpy2 is apparent in this visual table.
The package can be installed using
install.packages("cassowaryr")
from CRAN and using
remotes::install_github("numbats/cassowaryr")
installs the development version.
More documentation of the package can be found at the web site https://numbats.github.io/cassowaryr/.
The cassowaryr package comes with several data sets that load with the package, they are described here.
| dt | text |
|---|---|
| features | Simulated data with special features. |
| anscombe_tidy | Data from Anscombes famous example in tidy format. |
| datasaurus_dozen | Datasaurus Dozen data in a long tidy format. |
| datasaurus_dozen_wide | Datasaurus Dozen Data in a wide tidy format. |
| numbat | A toy data set with a numbat shape hidden among noise variables. |
| pk | Parkinsons data from UCI machine learning archive. |
The scagnostics functions functions either calculate each scagnostic measure, or are involved in the process of calcuating a scanostic measure (such as making the hull objects). These functions are low level functions, and while they are exported and can be used, they are not the intended method of calcuating scagnostics as they perform no outlier removal, however they are still an option for users if they wish. In some cases, such as sc_clumpy_r for clumpy robust, they are the only method to calculate that scagnostic. the The functions in this group are:
| dt | text |
|---|---|
| scree | Generates a scree object that contains the Delaunay triangulation of the scater plot. |
| sc_clumpy | Compute the original clumpy scagnostic measure. |
| sc_clumpy2 | Compute adjusted clumpy scagnositc measure. |
| sc_clumpy_r | Compute robust clumpy scagnostic measure. |
| sc_convex | Compute the original convex scagnostic measure |
| sc_dcor | Compute the distance correlation index. |
| sc_monotonic | Compute the Spearman correlation. |
| sc_outlying | Compute the original outlying scagnostic measure. |
| sc_skewed | Compute the original skewed scagnostic measure. |
| sc_skinny | Compute the original skinny scagnostic measure. |
| sc_sparse | Compute the original sparse scagnostic measure. |
| sc_sparse2 | Compute adjusted sparse measure. |
| sc_splines | Compute the spline based index. |
| sc_striated | Compute the original stirated scagnostic measure. |
| sc_striated2 | Compute angle adjusted striated measure. |
| sc_stringy | Compute stringy scagnostic measure. |
The drawing functions are intended to be used my the user so they can better understand the results of the scagnostic functions. The input is two numeric vectors and the output is a ggplot object that draws the graph based object in question. The functions that belong to this group are:
| func_name | description |
|---|---|
| draw_alphahull | Drawing the alpha hull. |
| draw_convexhull | Drawing the convex hull. |
| draw_mst | Drawing the MST. |
The summary functions are the perferred method for users to calculate scagnostics. The calc_scags function is supposed to be used on long data with the dplyr group_by and summarise functions. the calc_scags_wide functions is designed to take in a tibble of numerical variables and return the scagnostics on every possible pairwise scatter plot. Both functions return a tibble where each column is a scagnostics. These are the two main functions of the package.
| func_name | description |
|---|---|
| calc_scags | Compute selected scagnostics on subsets. |
| calc_scags_wide | Compute scagnostics on all possible scatter plots for the given data. |
The main arguments of the calc_scags function are shown in **Table __**.
| argument | description |
|---|---|
| y | numeric vector of x values. |
| x | numeric vector of y values. |
| scags | collection of strings matching names of scagnostics to calculate: outlying, stringy, striated, striated2, striped, clumpy, clumpy2, sparse, skewed, convex, skinny, monotonic, splines, dcor. The default is to calculate all scagnostics. |
While the calc_scags function does not take in a tibble, it is designed to be seamlessly integrated into the tidy_data format. Currently if we specify the scagnostics in the summarise function it does not work correctly, however we can work around that with the filter option until it is fixed. Using the code below, we can calculate a specified list of scagnostics on the scatter plots from the features data, producing the output shown in **Table __**.
features_scags <- features %>%
group_by(feature) %>%
summarise(calc_scags(x,y)) %>%
select(c(feature, outlying, clumpy2, monotonic))
| feature | outlying | clumpy2 | monotonic |
|---|---|---|---|
| barrier | 0.0000000 | 0.0000000 | 0.3480849 |
| clusters | 0.0551486 | 0.8340918 | 0.0343677 |
| discrete | 0.0000000 | 0.0000000 | 0.0082035 |
| disk | 0.0185229 | 0.2237189 | 0.0855451 |
| gaps | 0.0000000 | 0.7499695 | 0.0572168 |
| l-shape | 0.3849340 | 0.0000000 | 0.4797960 |
| line | 0.1137124 | 0.0000000 | 1.0000000 |
| nonlinear1 | 0.2715451 | 0.0000000 | 0.1684822 |
| nonlinear2 | 0.0000000 | 0.0000000 | 0.8094209 |
| outliers | 0.0000000 | 0.0000000 | 0.7051628 |
| outliers2 | 0.5906739 | 0.0000000 | 0.0600126 |
| positive | 0.1380011 | 0.1805457 | 0.9206001 |
| ring | 0.0189650 | 0.3666103 | 0.0446935 |
| vlines | 0.0000000 | 0.1292846 | 0.0824344 |
| weak | 0.0491130 | 0.0000000 | 0.4080648 |
We are also considering an additional two summary functions that could be introduced to the package. While the code required to write them is simple and easily performed by the user, having them as ready functions in the package would help guide users to use the package most effectively. The two additional functions that have not yet been implemented into the package are calc_topscacs and calc_toppairs which are described in **table__**.
| func_name | description |
|---|---|
| calc_topscacs | Return the top scagnostic value for each pair of variables |
| calc_toppairs | Return the top pair of variables for each scagnostic |
The code for both are simple, but an example of how to calculate calc_toppairs with its output will be shown here. In this case it is the top groups, but the main idea is the same as if we had used calc_scags wide to generate pairs of variables.
features_toppairs <- features_scags %>%
pivot_longer(!feature, names_to = "scag", values_to = "value") %>%
arrange(desc(value)) %>%
group_by(scag) %>%
slice_head(n=1)
| feature | scag | value |
|---|---|---|
| clusters | clumpy2 | 0.8340918 |
| line | monotonic | 1.0000000 |
| outliers2 | outlying | 0.5906739 |
All the functions that calculate the scagnostic measures (all the measures that start with “sc”) have tests written and implemented using the testthat package. They have all been compared to calculations completed by hand to ensure the difference in results from previous literature are due to other steps in the process, such as binning, and not a mistake in the write up of the code. These tests also illuminated some issues that allowed us to make meaningful changes to the definitions of the scagnostics and the implementation of the package. For example, several tests to check the outying scagnostic was working correctly illustrated some issues in the process of outlier removal, which is illustrated in **Figure __**.
(#fig:Outlying test plot)Outlying Test Plots
**Figure __** below shows the an example of a simulated test set, combined with the associated MST. When creating this test data set, we assumed the MST would connect via the red line shown on the left plot of **Figure __**, but instead the MST connected via the long black line. The difference between these choices is essentially random, they are the exact same length, but it has significant implications for the value returned by the outlying scagnostic. This test was designed to check the outlier removal process for internal outliers, point 1 should have been identified as an internal outlier which means both its edges were considered in the calculation of outlying, and points 2 and 3 are too close to each other for either to fulfill the outlying definition, so they are left alone. Using the draw_mst function when the test failled showed the issue was an esentially random desicionin the MST construction. If the red line was used to construct the MST, both the red dashed line, and the line connecting point 1 to point 2 would be included in the outlying scagnostic calcuation, in the actual calculation it was only the edge between points 1 and 2, giving a significantly smaller value on the the outlying scagnostic. This shows that even the scagnostics that work reliably well and did not need significant adjustments are still succeptible to arbitrarily large changes resulting from seemingly small changes in the visual structure of the scatter plot.
The Australian Football League Women’s (AFLW) is the national semi-profesisonal Australia Rules football league for female players. Here we will analyse data sourced from the official AFL website with information on the 2020 season, in which the league had 14 teams and 1932 players. These variables are recorded per player per game, so the stats are averaged for each player over the course of the season. The description of each statistic data set can be found in the Appendix. There are 68 variables, 33 of which are numeric, the others are categorical, e.g. players names or match ids, and they would not be used in scagnostic calculations. This means there are 528 possible scatterplots, significantly more than a single person could view and analyse themselves and so we use scagnostics to identify which pairwise plots might be interesting to examine.
**Figure__** displays 5 scatter plots (Plots 1 to 5 in the figure) that were identified as having a particularly high or low value on a scagnostic, or an unusual combination of two or more scagnostics. In addition to these 5, there is a 6th plot (Plot 6 in the figure) that is included to display what a middling value on almost all of the scagnostics looks like.
Most scatter plots score middling values on the scagnostics, so Plot 6 is a good indication of what we would look at if we picked variables to plot ourselves with no intuition. The visual structure that changes significantly between Plots 1 to 5, and the lack of interesting visual features in Plot 6, shows the benefit of using scagnostics in early stages of exploratory data analysis. Extreme values on the scagnostic measurements identify atypical scatter plots.
The best way to identify interesting scatter plots using scagnostics is to construct a large interactive SPLOM. This is how Plots 1 to 5 were identified, but for the sake of space, we are only going to show the specific scatter plots of the SPLOM that led to the selection of Plots 1, 2, and 5.
**Figure __** displays Plot 1, Plot2 and Plot 5 beneath the specific scatter plot of the scagnostics SPLOM that was used to identify the plot as interesting. Plot 1 was identifies as interesting as it returned high values on both outlying and skewed. Intuitively, this would indicate that even after removing outliers, the data was still disproportionately spread out, a visual feature that we can see very clearly in Plot 1. Plot 2 scored very highly on all the association measures, which indicates a strong relationship between the two variables. The three association measures typically have strong correlation and scatter plots that stay within the large mass in the center have a linear relation, those that don’t often have a non-linear relationship. The splines vs dcor plot tells us that there is a strong linear relationship between total posessions and disposals. Total possessions is the number of times the player has the ball and disposals is the number of times the player gets rid of the ball legally, so the strong linear relation indicates the level of play, i.e. few mistakes are made in a professional league. Plot 5 an excellent example in what new information we can learn from a unique plot identified with scagnostics. This plot is high on striated2 and moderate to low on outlying, telling us most of the points will be at straight or right angles and a little spread out. If a specific sports statistic is related to position, we would see a relationship have a lower triangular structure similar to that of Plot 4, however this plot does not have a lower triangular structure, is has an L-shape. This means these statistics are not about position, but rather the physical abilities of the players. Hitouts measure the number of times the player punches the ball after the referee throws it back into play, bounces have to be done while running, and are typically done by fast players. The L-shape tells us that players who do one very rarely perform the other. The moderate spread along both of the statistics tells us these are both somewhat specialised skills, and the players who specialise in one do not specialise in the other, i.e. in AFL the tallest player in the team is rarely the fastest. These plots provide a clear example in the unique information gained using scagnostics as a tool in exploratory data analysis.
Physics data often contains multiple variables with highly non-linear pairwise relationships that often also have a large amount of clustering, which makes this type of data ideal for investigating the capability of the splines and clumpy2 scagnostics. Two scagnostics that’s uses were not particularly visible in the AFLW example. Here we will use scagnostics to study a simulated dataset that contains posterior samples describing a gravitational waves signal from a black hole merger. The data contains 13 variables that describe _____(what event)___, the explanation of each can be found in the appendix. In this example looking at the complete SPLOM is still feasible, and could be used to identify several interesting scatterplots and the corresponding combination of variables. Most notably we can see non-linear and non-functional relations between pairs of variables, and we expect that these should stand out on the scagnostics measures as well. The full data file contains 9998 posterior samples, without binning it is too long to compute the scagnostics on such a large number of observations. For our purpose a much smaller sample is sufficient, and we randomly sample 200 observations before computing the scagnostics. We will focus on the structures we know exist by looking at which scatter plots significant difference in their splines and dcor values, as well as which plots stand out on the clumpy2 measure.
Figure 2: Selected pairs of scagnostics computed for the black hole mergers data. Groups of parameter combinations can be seen to stand out in the left plot (high on skinny and low on convex) and in the middle plot (high on both dcor and splines). The plot on the right shows clumpy vs clumpy2, where we can see the big impact of the correction for this dataset.
Figure 2 shows scatterplots of the computed scagnostics measures, where several combinations stand out. On the left plot we see three points with very low values of the convex measure and high values of skinny. These are all possible combinations containing the variables time, ra and dec, and the corresponding scatterplots are shown in the upper row of Figure 3. Because of how a single experiment observes the sky, there is an interesting pattern between these variables, with posterior samples being drawn along a non-linear band in this three-dimensional space.
These variables also stand out in the middle plot of Figure 2, where it is interesting to note that the cominations with non-linear but functional relation (time vs ra and dec vs ra) have somewhat higher values in the splines measure compared to dcor. On the other hand dec vs time does not exhibit a functional relation, and consequently gets a higher dcor score compared to splines (with both measures still taking large values). This also happens for two other combinations: m1 vs m2 and chi_p vs chi_tot, which are shown in the bottom row (left and middle) of Figure 3. We see that both these combinations show noisy linear relations.
Another interesting aspect with this dataset is that there are several combinations that lead to visible separations between groups of points. It is thus an ideal testcase for our new implementation of clumpy2. The right plot in Figure 2 shows clumpy vs clumpy2, and reveals large differences between the two measures. In particular there are many combinations without visible clustering, that still score high on clumpy, but where clumpy2 is zero. On the other hand we can see that there are several combinations that do lead to visible separation between groups that stand out in terms of clumpy2, but not the original clumpy. One example is time vs alpha, shown in the bottom right plot of Figure 3.
Figure 3: Features in the BBH data that stand out on several of the scagnostics measures (convey, skinny, splines and dcor), showing strong relations between variables including non-linear and non-functional dependencies. The final example (time vs alpha) is expected to take high values in clumpy, but only stands out on the corrected clumpy2.
A potential application of scagnostics is to detect shape differences between groups. Commonly, classification focuses on differences in the means, or separations between groups, there are few techniques that focus on difference in shape. This difference in shape occurs when the variance patters between groups are different, and quadratic discriminant analysis (QDA), is a classical example of a method that takes this difference in variance into consideration. QDA assumes the distribution of each group is normal, and then draws a curved boundary between them that is furthest from each groups mean, but also respects that one group might have a larger elliptical variance-covariance than another group. While this method is useful when groups have different shapes, the technique is still limited by an assumption of normality. Scagnostics could be utilised in a similar fashion to QDA to identify irregular shape differences between.
This analysis compares the features of two large collections time series, and then tries to differentiate them using scagnostics. The goal of the comparison is to compare shapes, not necessarily centres of groups as might be done in LDA or other machine learning methods. The two groups chosen for comparison are macroeconomic and microeconomic series. The data is pulled from the self-organizing database of time-series data (Fulcher et al. 2020), using the compenginets R package (Hyndman and Yang 2021). Since the time series are different lengths, each is described by a set of time series features (chapter 4 of Forecasting: Principles and Practice, 3rd Edition 2021) using the feasts R package (O’Hara-Wild, Hyndman, and Wang 2021).
For illustration, just a small set of features is examined, but still enough that the list of scatter plots identified by the scagnostics is significantly smaller than the list of all possible scatter plots. **Table __** shows the pair of features that maximises the difference between groups for each scagnostic. Plotting a handful of these in Figure 4, we can see the difference in shape that the scagnostics have identified. For example, the difference between the curvature and trend strength features shows both types of time series have, on average, strong trends and moderate curvature, however the former varies more in the macroeconomic series and the later in the microeconomic series. We can see from this example, and the other comparisons in the plot, that the scagnostics have identified a difference in shape that is not apparent in the mean of the data. Similar comments can be made about the other two plots in Figure 4.
While we have shown that the scagnostics succeed in identifying difference in shapes between groups, this does not automatically transfer to a classification technique. Utilising the scagnostics ability to identify between group shape differences is an early step in using them for classification. It is not uncommon for surpervised learning methods to be born from unsupervised learning methods. For example, principal component analysis transforms a dataset by making linear combinations of the old variables in the direction of most variance, and using these transformed variables in a linear regression can improve results. However, despite its promise, developing a classification technique is beyond the scope of this research.
| Var1 | Var2 | scags | macro_value | micro_value | scag_dif |
|---|---|---|---|---|---|
| pacf5 | linearity | clumpy2 | 0.7830480 | 0.0000000 | 0.7830480 |
| longest_flat_spot | trend_strength | convex | 0.1202349 | 0.6212495 | 0.5010146 |
| pacf5 | diff1_acf1 | outlying | 0.3206680 | 0.7093774 | 0.3887095 |
| curvature | trend_strength | skewed | 0.6555525 | 0.8438398 | 0.1882874 |
| longest_flat_spot | trend_strength | skinny | 0.6436844 | 0.3716400 | 0.2720444 |
| acf1 | trend_strength | sparse | 0.0370531 | 0.1081571 | 0.0711040 |
| pacf5 | acf1 | splines | 0.8751372 | 0.0000000 | 0.8751372 |
| longest_flat_spot | diff1_acf1 | striated2 | 0.1290323 | 0.0645161 | 0.0645161 |
| diff1_acf1 | trend_strength | stringy | 0.8421053 | 0.7272727 | 0.1148325 |
Figure 4: Interesting differences between two groups of time series detected by scagnostics. The time series are described by time series features, in order to handle different length series. Scagnostics are computed on these features separately for each set to explore for shape differences.
The World Bank delivers a lot of development indicators (World Bank 2021), for many countries and multiple years. The sheer volume of indicators, in addition to substantial missing values, makes a barrier to analysis. This is a good example to where scagnostics can be used to identify pairs of indicators with interesting relationships.
Here we have downloaded indicators from 2018 for a number of countries. First, the data needs some pre-processing, to remove variables which have mostly missing values, and countries which have mostly missing values. The scagnostics will be calculated on the pairwise complete data, so it is ok to leave a few sporadic missings. At the end of the pre-processing, there are 20 indicators for 79 countries.
Figure 5: Most of the pairs of indicators exhibit outliersor are stringy. There is one pair that has clumpy as the highest value. There are numerous pairs that have a highest value on convex.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".